Capital One Data Challenge - Airbnb & Zillow Case Study

Overview

Business Problem

A real estate company plans to purchase properties in New York to rent out short-term as part of their business model. The real estate company has already concluded that two bedroom properties are the most profitable; however, the company doesn’t know which zip codes are the best to invest in.

The objective of this Data Challenge is to analyze the Airbnb and Zillow datasets and suggest the most profitable zipcodes in New York where 2 bedroom propeties can be purchased and rented out on a short term.

Datasets

Zillow dataset (cost data): Provides selling cost for 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017.

Airbnb dataset (revenue data): Information about property listings in New York including location, number of bedrooms, reviews, price, availability, property description, etc. AirBnB is the medium through which the real estate company plans to lease out their investment property.

Assumptions

  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
  • All properties and all square feet within each locale can be assumed to be homogeneous
  • Occupancy rate of 75% throughout the year for Airbnb properties.
  • The company will put properties on rent every day throughout the year.
  • We assume that a booking usually last 3 days, we came to this conclusion after reading a study from Pillow.con, so we will consider Revenue Per Day as Price + (Cleaning Fee/3)

Process

Data Loading: Loading the required dataset

Data Quality: The real world datasets contains several inconsistencies that will be dealt with and data needs to be filtered for NY.

Data Munging: This section concentrates on linking and making the data homogeneous in terms of units, in a scalable manner.

Visual Analysis: This sections provides graphs to help us analyze the most profitable zipcodes

Conclusions: This chunk concentrates on providing further business insights into how the client can achieve greater profits by employing the suggested strategies.

Note

Selling cost information of 2 bedroom properties is available only until June 2017. I perform analysis to fill in the missing cost values by inputing them using LOCF and NOCB : last observation carried forward and next observation carried backwards are two standard ways to achieve such imputations. This will help if we get a specific time-stamp of the airbnb data.

I didn’t forecast the cost at a time point further in 2019.

Packages Used

Loading Required Packages
# Store required packages in an array
pkgs <- c(pkgs <- c("data.table","dplyr","ggplot2","tidyr","naniar",
                    "GGally","Matrix","plotly","ggthemes"))

# If package is not present, install the packages
for (pkg in pkgs) {
if (!(pkg %in% installed.packages()[, "Package"])) {
install.packages(pkg)
}
}
library("data.table")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.5.2
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.5.2
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.5.2
library("ggthemes")
## Warning: package 'ggthemes' was built under R version 3.5.2
library("GGally")
library("Matrix")
library("plotly")
## Warning: package 'plotly' was built under R version 3.5.2
library("naniar")

Data Loading

Declarations

The Airbnb data set to be loaded is listings.csv and the Zillow dataset is Zip_Zhvi_2bedroom.csv

airbnbDir <- "listings.csv" # read csv file
zillowDir <- "Zip_Zhvi_2bedroom.csv"
Reading From Files

Loading files and replacing empty cell values with NA, so it helps with data cleaning later on

airbnb <- fread(airbnbDir,header = TRUE, sep = "," , stringsAsFactors = FALSE, na.strings = c("","NA"))

zillow <- fread(zillowDir,header = TRUE, sep = "," , stringsAsFactors = FALSE, na.strings = c("","NA"))
Functions

Last observation carried forward impuated function: This function helps us in filling missing historical cost values in zillow data

na.locf.default <- function(x) {
   v <- !is.na(x)
  return(c(NA, x[v])[cumsum(v)+1])
}

Data Quality

Checks for Zillow

1. Missing Counts
ds0<- data.frame(MissingCounts= apply(zillow,2, function (x) sum(is.na(x))/nrow(zillow)*100 ))
summary(ds0$MissingCounts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.9474 13.8721 11.3655 17.2815 29.7563

Missings are reasonable for the overall dataset. While the house price columns have missings with some time snapshots but that may not be an issue if missings are random accross time as that can be handled with imputations.

2. Missing Counts Across Time, Random or Not?
ds1 <- rowSums(zillow[, c(8:262)],na.rm = TRUE)
summary(ds1)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   2109800  19103100  29690900  36939674  44956950 391390500

No NAs found, and that means we have some value for each city zipcode at some given point in time, and hence this is not a concern

3. Non-Negative Prices
ds2<- apply(zillow[, c(8:262)],1,min,na.rm = TRUE)
summary(ds2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22400   60800   82100   98650  112575 1428000

Result: As expected, prices are non-negative

4. Filtering Data for New York
zillow_filtered <- zillow[which(zillow$State =="NY"),]
nrow(zillow_filtered)
## [1] 475
5. Changing Column Name to Zipcode

Changing zillow column name “RegionName” to “Zipcode”, this will help in merging later

setnames(zillow_filtered,"RegionName", "zipcode")

Checks for Airbnb

dim(airbnb)
## [1] 48895   106
1. State Corrections
table(airbnb$state)
## 
##                CA       MP New York       NJ       ny       Ny       NY 
##        3        2        1        1        2        3        7    48873
airbnb$state <- (gsub("New York","NY",airbnb$state))
airbnb$state <- (gsub("ny","NY",airbnb$state))
airbnb$state <- (gsub("Ny","NY",airbnb$state))
2. Filtering Data for New York and 2 Bedroom Properties
airbnb_filtered <- airbnb[which(airbnb$state=="NY" & airbnb$bedrooms == 2),]
nrow(airbnb_filtered)
## [1] 6496
3.Changing Zipcode Column Type

Changing the column type to character for easier merging and uniform structure

zillow_filtered$zipcode <- as.character(zillow_filtered$zipcode)
airbnb_filtered$zipcode <- as.character(airbnb_filtered$zipcode)

Data Munging

Covers the data manipulations and new char creation to accomplish a scalable and accurate approach to handle larger data spanning multiple revenue snapshots (airbnb).

Please Note: Currently the airbnb data is available for a single snapshot in 2019. On ground problem can have multiple time stamps. So while merging datasets, we need to match cost and revenue by zipcode as well the snapshot for which they are applicable.

Derive insights from the raw and derieved fields

Zillow Data From Wide to Long Format
# melt zillow_filtered data to have a row with price for each time point.
# this allows for merge prices and cost accurate at zip and time level.
zillow_filLong <- melt(zillow_filtered,id=c("RegionID","zipcode","City","State","Metro","CountyName","SizeRank"))
setnames(zillow_filLong,"variable", "time")
setnames(zillow_filLong,"value", "cost")

# summary of cost in long data
summary(zillow_filLong$cost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32700   74900  109100  175505  194400 3347100   13985
# filter relevant columns from zillow_filLong
zillow_filLong<- zillow_filLong[,c("zipcode", "time", "cost")]
Imputations To Property Prices

Why do this? This helps to scale analysis to be carried over a longer window of data wherein we are able to capture house prices over different times and aptly match them rentals, merging by a combination of zipcode and monthYear

LOCF and NOCB : last observation carried forward and next observation carried backwards are two standard ways to achieve such imputations.

zillow_filLong <- zillow_filLong[order(zipcode,time)]
zillow_filLong <- zillow_filLong[, cost:= na.locf.default(cost), by=.(zipcode)]
zillow_filLong <- zillow_filLong[order(zipcode,-time)]
zillow_filLong <- zillow_filLong[, cost:= na.locf.default(cost), by=.(zipcode)]
summary(zillow_filLong$cost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32700   75400  109900  183091  200400 3347100

Great! Now we have cost points for all times and zipcodes.

Airbnb Data Dates Investigation
table(airbnb_filtered$last_scraped,useNA = "ifany")
## 
## 2019-07-08 2019-07-09 
##       4738       1758
table(airbnb_filtered$calendar_last_scraped,useNA = "ifany")
## 
## 2019-07-08 2019-07-09 
##       4738       1758
table(airbnb_filtered$calendar_updated,useNA = "ifany")
## 
##    1 week ago 10 months ago 11 months ago 12 months ago 13 months ago 
##            52            42            51            46            40 
## 14 months ago 15 months ago 16 months ago 17 months ago 18 months ago 
##            43            39            41            53            37 
## 19 months ago    2 days ago  2 months ago   2 weeks ago 20 months ago 
##            63            79           376           507            42 
## 21 months ago 22 months ago 23 months ago 24 months ago 25 months ago 
##            19            20            43            43            45 
## 26 months ago 27 months ago 28 months ago 29 months ago    3 days ago 
##            27            31            24            25           199 
##  3 months ago   3 weeks ago 30 months ago 31 months ago 32 months ago 
##           289           342            22            51            24 
## 33 months ago 34 months ago 35 months ago 36 months ago 37 months ago 
##            21            29            29            39            35 
## 38 months ago 39 months ago    4 days ago  4 months ago   4 weeks ago 
##            26            19           112           170           252 
## 40 months ago 41 months ago 42 months ago 43 months ago 44 months ago 
##            15            22            20            47            29 
## 45 months ago 46 months ago 47 months ago 48 months ago 49 months ago 
##            23            15            18            11            10 
##    5 days ago  5 months ago   5 weeks ago 50 months ago 53 months ago 
##           161           139           220             3             3 
## 54 months ago 55 months ago 56 months ago 57 months ago 58 months ago 
##             1             4             5             1             1 
##    6 days ago  6 months ago   6 weeks ago 60 months ago 61 months ago 
##            70           108           179             3             1 
## 64 months ago 66 months ago 67 months ago 68 months ago  7 months ago 
##             1             7             1             1           114 
##   7 weeks ago 74 months ago  8 months ago  9 months ago    a week ago 
##           105             1            72            41           421 
##         never         today     yesterday 
##            15           941           220

Availability update is scattered over time, though the information of scrapping is pretty recent. In the Airbnb data we have the data of the last pull. We assume the prices shown are active at current time.

Further, since the cost/house price data is available till 2017, we assume it to be the same through 2019.

In ideal case we may want to fetch the prices for 2017 from Airbnb to match the timelines of cost and revenue, but that comes at the cost of analysis losing its recency since it would be done on two year old scenario and things might have changed since OR we could use forecast techniques like ARIMA to arrive at prices of 2019

# Recent data most applicable.
zillow_filLong <- zillow_filLong[time=="2017-06"]
Merging Datasets Using Zipcode
# merge two datasets by zipcode... since a single time overlapp is available
merge_data <- merge(airbnb_filtered, zillow_filLong , by = "zipcode" )
Selecting Relevant Columns
relevantColumns<-c("zipcode","street","neighbourhood_group_cleansed","latitude","longitude","square_feet"
,"cost","price","cleaning_fee","minimum_nights"
,"maximum_nights","number_of_reviews","review_scores_rating"
,"time","property_type","room_type"
,"bedrooms")

merge_data <- merge_data[,relevantColumns,with=F]
Formatting Prices

Formatting price and cleaning_fees columns to remove $ symbol and changing type to numeric

merge_data$price <- (gsub("\\$","",merge_data$price))
merge_data$price <- (gsub("\\,","",merge_data$price))
merge_data$cleaning_fee <- (gsub("\\$","",merge_data$cleaning_fee))
merge_data$cleaning_fee <- (gsub("\\,","",merge_data$cleaning_fee))

merge_data$cleaning_fee <- as.numeric(merge_data$cleaning_fee)
merge_data$price <- as.numeric(merge_data$price)
Missing Value Analysis
gg_miss_var(merge_data, show_pct = TRUE) + labs(y = "Percentage")

Inputing Missing Values

Impute Missing values into cleaning_fee column: from summary, the mean, median, and mode are almost the same values, hence mean is chosen

summary(merge_data$cleaning_fee)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    75.0   100.0   109.8   150.0   482.0     211
merge_data <- merge_data[, cleaning_fee:= ifelse(is.na(cleaning_fee),mean(cleaning_fee,na.rm=TRUE),cleaning_fee)]

summary(merge_data$cleaning_fee) #All NA's were inputed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    75.0   100.0   109.8   134.0   482.0
Correct Price For Room Type

Also, price of the daily rental in Revenue data is reflective of the space that is offered and not the entire property itself. The price must be specifically corrected to account for entire property to account the benefit. Assumption Made: If the property type == Private Room, it is multipled by number of bedrooms to account for overall price. Correction applied is returned to original price column.

merge_data <- merge_data[, price:=ifelse(room_type == "Private room",price * bedrooms, price)]
Revenue Per Day

We are assuming an average stay is for 3 days (infered this from a study from Pillow.com), so effectively a share of cleaning fee adds to the revenue every day. This variable will be used later in calcualting other variables for visualisations.

merge_data[, revenuePerDay:= price+cleaning_fee/3]
Outlier Treatment

While Outlier treatment is very useful in regression analysis, in a case like this its usefulness is limited to within group (zipcode) here.

outliers <- quantile(merge_data$revenuePerDay,probs = c(0.01,0.99))

merge_data[, revenuePerDay:= ifelse(revenuePerDay >=outliers[2],outliers[2],revenuePerDay)]
merge_data[, revenuePerDay:= ifelse(revenuePerDay <=outliers[1],outliers[1],revenuePerDay)]
summary(merge_data$revenuePerDay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   103.0   200.3   272.7   324.1   376.7  1481.2
outliers <- quantile(merge_data$cost,probs = c(0.01,0.99))

merge_data[, cost:= ifelse(cost >=outliers[2],outliers[2],cost)]
merge_data[, cost:= ifelse(cost <=outliers[1],outliers[1],cost)]
summary(merge_data$cost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  382300 1302300 1712900 1791108 2147000 3316500
Summary
summary(merge_data)
##    zipcode             street          neighbourhood_group_cleansed
##  Length:1565        Length:1565        Length:1565                 
##  Class :character   Class :character   Class :character            
##  Mode  :character   Mode  :character   Mode  :character            
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##     latitude       longitude       square_feet          cost        
##  Min.   :40.52   Min.   :-74.21   Min.   :   0.0   Min.   : 382300  
##  1st Qu.:40.68   1st Qu.:-74.00   1st Qu.: 650.0   1st Qu.:1302300  
##  Median :40.73   Median :-73.99   Median :1000.0   Median :1712900  
##  Mean   :40.73   Mean   :-73.98   Mean   : 902.3   Mean   :1791108  
##  3rd Qu.:40.76   3rd Qu.:-73.97   3rd Qu.:1125.0   3rd Qu.:2147000  
##  Max.   :40.81   Max.   :-73.72   Max.   :1600.0   Max.   :3316500  
##                                   NA's   :1538                      
##      price         cleaning_fee   minimum_nights   maximum_nights    
##  Min.   :  50.0   Min.   :  0.0   Min.   :  1.00   Min.   :       1  
##  1st Qu.: 175.0   1st Qu.: 75.0   1st Qu.:  2.00   1st Qu.:      30  
##  Median : 239.0   Median :100.0   Median :  3.00   Median :    1125  
##  Mean   : 294.2   Mean   :109.8   Mean   : 10.14   Mean   :   13463  
##  3rd Qu.: 338.0   3rd Qu.:134.0   3rd Qu.:  7.00   3rd Qu.:    1125  
##  Max.   :4000.0   Max.   :482.0   Max.   :365.00   Max.   :20000000  
##                                                                      
##  number_of_reviews review_scores_rating      time      property_type     
##  Min.   :  0.00    Min.   : 20.00       2017-06:1565   Length:1565       
##  1st Qu.:  1.00    1st Qu.: 92.00       1996-04:   0   Class :character  
##  Median :  4.00    Median : 96.00       1996-05:   0   Mode  :character  
##  Mean   : 19.78    Mean   : 94.15       1996-06:   0                     
##  3rd Qu.: 17.00    3rd Qu.:100.00       1996-07:   0                     
##  Max.   :403.00    Max.   :100.00       1996-08:   0                     
##                    NA's   :387          (Other):   0                     
##   room_type            bedrooms revenuePerDay   
##  Length:1565        Min.   :2   Min.   : 103.0  
##  Class :character   1st Qu.:2   1st Qu.: 200.3  
##  Mode  :character   Median :2   Median : 272.7  
##                     Mean   :2   Mean   : 324.1  
##                     3rd Qu.:2   3rd Qu.: 376.7  
##                     Max.   :2   Max.   :1481.2  
## 
Glimpse
glimpse(merge_data)
## Observations: 1,565
## Variables: 18
## $ zipcode                      <chr> "10003", "10003", "10003", "10003",…
## $ street                       <chr> "New York, NY, United States", "New…
## $ neighbourhood_group_cleansed <chr> "Manhattan", "Manhattan", "Manhatta…
## $ latitude                     <dbl> 40.72577, 40.72939, 40.73476, 40.73…
## $ longitude                    <dbl> -73.98745, -73.98857, -73.98452, -7…
## $ square_feet                  <int> NA, 1100, NA, 1000, NA, NA, 800, NA…
## $ cost                         <dbl> 2147000, 2147000, 2147000, 2147000,…
## $ price                        <dbl> 249, 189, 400, 245, 549, 151, 240, …
## $ cleaning_fee                 <dbl> 85.0000, 115.0000, 120.0000, 200.00…
## $ minimum_nights               <int> 5, 2, 2, 5, 30, 2, 5, 2, 5, 30, 2, …
## $ maximum_nights               <int> 365, 28, 8, 1125, 365, 28, 365, 112…
## $ number_of_reviews            <int> 166, 403, 105, 18, 58, 214, 86, 66,…
## $ review_scores_rating         <int> 85, 86, 94, 96, 94, 84, 96, 97, 100…
## $ time                         <fct> 2017-06, 2017-06, 2017-06, 2017-06,…
## $ property_type                <chr> "Apartment", "Apartment", "Apartmen…
## $ room_type                    <chr> "Entire home/apt", "Entire home/apt…
## $ bedrooms                     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ revenuePerDay                <dbl> 277.3333, 227.3333, 440.0000, 311.6…

Final Analysis and Visualizations

1. Revenue To Cost Ratio

In this approach, we are calculating Revenue/Cost Ratio for the first year, we are using our assumption that an average stay is for 3 days, so effectively a share of cleaning fee adds to the revenue every day. We are also considering a 75% occupancy rate throughout the year.

We calulate this ratio at record level and then aggregate it at zipcode level, this helps us in identifying zipcodes which has better returns.

merge_data[,revenueToCostRatio:= (revenuePerDay*365*.75)/cost * 100]

merge_data_summary<- merge_data[, .(averageRevenueToCostRatio=mean(revenueToCostRatio), AvgCost= mean(cost), count= length(cost)), by=.(zipcode)]

rentalReturn<- ggplot(merge_data_summary, aes(x = AvgCost, y = averageRevenueToCostRatio)) +geom_point(aes(colour=zipcode, size= count)) + labs(y = "Average Revenue To Cost Ratio", x = "Average Cost") + scale_x_continuous(labels = scales::comma) + theme_clean()
ggplotly(rentalReturn)
  • Lower cost and highest return are found in the leftmost corner, thus establishing that highest valued properties can be avoided as they dont lead to higher returns.
  • The size of point signifies how many properties the corresponding zipcode has, we can see that 10025 has the best ratio along with significiant amount of properties
  • We need to be cautious of lower counts in those neighbourhoods as the evidence could be not that significant/prone to variability or a non statistical reason of less conducive environment.
Top Zipcodes Based On Revenue Ratio
merge_data_summary <- merge_data_summary[order(-averageRevenueToCostRatio)]
topRentalReturn <- head(unique(merge_data_summary$zipcode),15)
print(topRentalReturn)
##  [1] "10308" "11003" "11434" "10305" "10306" "11234" "10303" "10314"
##  [9] "10304" "10309" "10025" "10036" "11231" "11215" "11201"

2. Years to Profit Analysis (Payback Period)

The first approach considered the revenue (rent) generated from the property with the regards to the cost in a year. In this approach, we consider that property is going to be held indefinitely we will calculate the number of years it will take for the property to breakeven i.e. the number of years it will take to recover the initial investment.

Setting our variables (We assumed occupancy rate of 75% throughout the year for properties here)

merge_data[,totalAnnualIncome:= revenuePerDay*365*.75]
merge_data[,yearsToProfit:= cost/totalAnnualIncome]
merge_data_year_return<- merge_data[, .(averageYearsToProfit=mean(yearsToProfit), AvgCost= mean(cost)), by=.(zipcode)]

yearReturn<- ggplot(merge_data_year_return, aes(x = reorder(zipcode, -averageYearsToProfit), y = averageYearsToProfit)) + geom_bar(stat="identity", width = .3) + labs(y = "Average Years to Profit", x = "Zip Code") + coord_flip() + theme_clean()
ggplotly(yearReturn,tooltip = c("zipcode","averageYearsToProfit"))

As we can the properties on the top are the quickest to reach their breakeven point

Top Zipcodes Based On Payback Period
merge_data_year_return <- merge_data_year_return[order(averageYearsToProfit)]
topYearReturn <- head(unique(merge_data_year_return$zipcode),15)
print(topYearReturn)
##  [1] "10308" "11003" "11434" "10306" "10305" "11234" "10303" "10314"
##  [9] "10304" "10309" "11215" "10036" "11201" "10025" "11231"

3. Property Appreciation Analysis

Property Appreciation or Real estate appreciation is a simple concept. It refers to how the value of an investment property increases with time.

As we know, property itself can appreciate/depreciate in value, this helps in understanding which properties would be a better investment.

Creating a dataset with avg prices across years across pincodes (long format)

zillow_filLong2 <- melt(zillow_filtered,id=c("RegionID","zipcode","City","State","Metro","CountyName","SizeRank"))
zillow_filLong2[, year:=substr(variable,1,4) ]
annual_cost <- zillow_filLong2[,.(avg_value=mean(value,na.rm=TRUE)) , by=.(year, zipcode)]

h<-as.data.table(table(is.nan(annual_cost$avg_value),annual_cost$year))
h<- h[order(-N)]
print(head(h,20))
##        V1   V2   N
##  1: FALSE 2011 475
##  2: FALSE 2012 475
##  3: FALSE 2013 475
##  4: FALSE 2014 475
##  5: FALSE 2015 475
##  6: FALSE 2016 475
##  7: FALSE 2017 475
##  8: FALSE 2010 474
##  9: FALSE 2009 425
## 10: FALSE 2007 423
## 11: FALSE 2008 423
## 12: FALSE 2006 421
## 13: FALSE 2005 419
## 14: FALSE 2004 413
## 15: FALSE 2003 399
## 16: FALSE 2002 391
## 17: FALSE 2001 389
## 18: FALSE 2000 387
## 19: FALSE 1999 376
## 20: FALSE 1998 375
# no missings since 2011 
annual_cost <- annual_cost[year>=2011,]

annual_cost <- merge(annual_cost, unique(airbnb_filtered[,c("zipcode","neighbourhood_group_cleansed"),with=F]))

# plotting the results
costPlot <-   ggplot(annual_cost,aes(x = year,
                                     y = avg_value,
                                     group = zipcode,
                                     colour = neighbourhood_group_cleansed
)) + labs(y = "Average Value", x = "Year") + geom_line() + geom_point() + scale_y_continuous(labels = scales::comma) + theme_clean()
ggplotly(costPlot)

There are steep lines across different price ranges so a more meaningful analysis is to generate return on prices overtime

Rate of Return on property from Price Growth over 2011-17

# create a dataset with avg house prices across years across pincodes (wide format)
annual_cost_wide <- dcast.data.table(zipcode+neighbourhood_group_cleansed ~ year, data= annual_cost, value.var = "avg_value")
annual_cost_wide[, growthRate:= (`2017`/`2011`-1)*100]
annual_cost_wide$zipcode <- as.character(annual_cost_wide$zipcode)

growthPlot<- ggplot(annual_cost_wide, aes(x = `2017`, y = growthRate, colour=zipcode)) + geom_point()+ scale_y_continuous(labels = scales::comma) +scale_x_continuous(labels = scales::comma)+ labs(y = "Growth Rate", x = "2017 Prices") + theme_clean()
ggplotly(growthPlot)

Shows that mid range priced propery zones offer highest return in terms of property valuation.

Top Zipcodes By Property Appreciation

annual_cost_wide<- annual_cost_wide[order(-growthRate)]
#checkif these are available in revenue data
topGrowing <- unique(annual_cost_wide$zipcode[annual_cost_wide$zipcode %in% merge_data$zipcode])
topGrowing <- head(topGrowing,15)
print(topGrowing)
##  [1] "11217" "11231" "11201" "11215" "10128" "10011" "10025" "10014"
##  [9] "10028" "11434" "10003" "10023" "10013" "10021" "10036"

A combination of revenue and property price growth potential has to be considered when arriving at a decision.

topRev <- intersect(topYearReturn,topRentalReturn)
print(topRev)
##  [1] "10308" "11003" "11434" "10306" "10305" "11234" "10303" "10314"
##  [9] "10304" "10309" "11215" "10036" "11201" "10025" "11231"
topRevGrowing <- intersect(topRev,topGrowing)
print(topRevGrowing)
## [1] "11434" "11215" "10036" "11201" "10025" "11231"

Secondary Criteria

4. Budget Constraint

Lower priced property that comes high in revenue criteria is preferred since it lowers the potential cost of funding and investment at risk

merge_data_summary_sec <- merge_data_summary[zipcode %in% topRev]

cheapPricePlot<- ggplot(merge_data_summary_sec, aes(x = averageRevenueToCostRatio, y = AvgCost, colour=zipcode)) + geom_point()+ scale_y_continuous(labels = scales::comma) +scale_x_continuous(labels = scales::comma) + labs(y = "Avg Price", x = "Average Revenue to Cost Ratio") + theme_clean()

ggplotly(cheapPricePlot)
merge_data_summary_sec<- merge_data_summary_sec[order(AvgCost)]
topCheapBuys <- head(unique(merge_data_summary_sec$zipcode),15)
print(topCheapBuys)
##  [1] "11003" "11434" "10306" "10303" "10314" "10304" "10309" "10308"
##  [9] "10305" "11234" "11215" "11231" "11201" "10025" "10036"

Conclusion

Top Zipcodes Based on Our Analysis

zipCodeSFinal <- intersect(topRevGrowing,topCheapBuys)
print(zipCodeSFinal)
## [1] "11434" "11215" "10036" "11201" "10025" "11231"

Top 5 Zipcode Choices :

  • Zipcode 10036 in Manhattan shows up in all 3 criteria we used to measure the best zipcodes, this is mid-size investment that will yield high and has a good property growthrate.
  • Zipcode 10025 in Manhattan is not far behind, it is also present in all 3 measures and is cheaper than 10036 with a better growthRate, this one is the best property to buy in Manhattan

  • Zipcode 11201 in Brookyln is great option, it perfomed similar to the aforementioned properties in all analysis but had the 3rd highest growthRate
  • Zipcode 11215 in Brooklyn is very similar to 11201 in terms of performance but it is cheaper than the aformentioned
  • Zipcode 11231 in Brooklyn is a lot similar to zipcode 11215 in that the avg. cost of a 2 bedroom property and the avg. revenue per year obtained from a property are pretty close.

  • Zipcode 11434 in Queens breakevens the quickest of the lot reccomened here, since the properties are cheaper, the only downside is below average growthRate

Future Steps

  1. Use ARIMA to predict values after June, 2017
  2. Introduce seasonality and weather data to understand trends occupancy throughout the year.
  3. Use Stasticial methods to to understand occupancy and availability better
  4. Due to time constraints and scope of this projec - some of the coding practices such as memory management, variable nomenclature and other markdown specific functionalities were ignored. This would be automatic first step in the future scope of work.